至于其他形式的AI,最近已经对不同用户同伙的性能差异进行了研究。在语音识别方面实现公平性的一种方法是(1)确定遭受低标准表现的说话者队列,以及(2)采取针对发现同类的公平性缓解措施。在本文中,我们使用产品规模的AI助手语音识别系统的数据报告了发现和缓解性能差异的初步发现。我们将基于地理和人口统计学信息的队列发现与一种更可扩展的方法进行比较,该方法将使用扬声器嵌入技术分组没有人类标签的说话者。为了缓解公平性,我们发现对代表性不足的队列的过度采样,以及通过其他输入变量对扬声器队列的建模,从而减少了表现和底部性能队列之间的差距,而不会降低整体识别精度。
translated by 谷歌翻译
Searching long egocentric videos with natural language queries (NLQ) has compelling applications in augmented reality and robotics, where a fluid index into everything that a person (agent) has seen before could augment human memory and surface relevant information on demand. However, the structured nature of the learning problem (free-form text query inputs, localized video temporal window outputs) and its needle-in-a-haystack nature makes it both technically challenging and expensive to supervise. We introduce Narrations-as-Queries (NaQ), a data augmentation strategy that transforms standard video-text narrations into training data for a video query localization model. Validating our idea on the Ego4D benchmark, we find it has tremendous impact in practice. NaQ improves multiple top models by substantial margins (even doubling their accuracy), and yields the very best results to date on the Ego4D NLQ challenge, soundly outperforming all challenge winners in the CVPR and ECCV 2022 competitions and topping the current public leaderboard. Beyond achieving the state-of-the-art for NLQ, we also demonstrate unique properties of our approach such as gains on long-tail object queries, and the ability to perform zero-shot and few-shot NLQ.
translated by 谷歌翻译
Large Language Models (LLMs) have been the subject of active research, significantly advancing the field of Natural Language Processing (NLP). From BERT to BLOOM, LLMs have surpassed state-of-the-art results in various natural language tasks such as question answering, summarization, and text generation. Many ongoing efforts focus on understanding LLMs' capabilities, including their knowledge of the world, syntax, and semantics. However, extending the textual prowess of LLMs to symbolic reasoning has been slow and predominantly focused on tackling problems related to the mathematical field. In this paper, we explore the use of LLMs for automated planning - a branch of AI concerned with the realization of action sequences (plans) to achieve a goal, typically executed by intelligent agents, autonomous robots, and unmanned vehicles. We introduce Plansformer; an LLM fine-tuned on planning problems and capable of generating plans with favorable behavior in terms of correctness and length with reduced knowledge-engineering efforts. We also demonstrate the adaptability of Plansformer in solving different planning domains with varying complexities, owing to the transfer learning abilities of LLMs. For one configuration of Plansformer, we achieve ~97% valid plans, out of which ~95% are optimal for Towers of Hanoi - a puzzle-solving domain.
translated by 谷歌翻译
The number of international benchmarking competitions is steadily increasing in various fields of machine learning (ML) research and practice. So far, however, little is known about the common practice as well as bottlenecks faced by the community in tackling the research questions posed. To shed light on the status quo of algorithm development in the specific field of biomedical imaging analysis, we designed an international survey that was issued to all participants of challenges conducted in conjunction with the IEEE ISBI 2021 and MICCAI 2021 conferences (80 competitions in total). The survey covered participants' expertise and working environments, their chosen strategies, as well as algorithm characteristics. A median of 72% challenge participants took part in the survey. According to our results, knowledge exchange was the primary incentive (70%) for participation, while the reception of prize money played only a minor role (16%). While a median of 80 working hours was spent on method development, a large portion of participants stated that they did not have enough time for method development (32%). 25% perceived the infrastructure to be a bottleneck. Overall, 94% of all solutions were deep learning-based. Of these, 84% were based on standard architectures. 43% of the respondents reported that the data samples (e.g., images) were too large to be processed at once. This was most commonly addressed by patch-based training (69%), downsampling (37%), and solving 3D analysis tasks as a series of 2D tasks. K-fold cross-validation on the training set was performed by only 37% of the participants and only 50% of the participants performed ensembling based on multiple identical models (61%) or heterogeneous models (39%). 48% of the respondents applied postprocessing steps.
translated by 谷歌翻译
End-to-end text-to-speech (TTS) systems have been developed for European languages like English and Spanish with state-of-the-art speech quality, prosody, and naturalness. However, development of end-to-end TTS for Indian languages is lagging behind in terms of quality. The challenges involved in such a task are: 1) scarcity of quality training data; 2) low efficiency during training and inference; 3) slow convergence in the case of large vocabulary size. In our work reported in this paper, we have investigated the use of fine-tuning the English-pretrained Tacotron2 model with limited Sanskrit data to synthesize natural sounding speech in Sanskrit in low resource settings. Our experiments show encouraging results, achieving an overall MOS of 3.38 from 37 evaluators with good Sanskrit spoken knowledge. This is really a very good result, considering the fact that the speech data we have used is of duration 2.5 hours only.
translated by 谷歌翻译
Through their transfer learning abilities, highly-parameterized large pre-trained language models have dominated the NLP landscape for a multitude of downstream language tasks. Though linguistically proficient, the inability of these models to incorporate the learning of non-linguistic entities (numerals and arithmetic reasoning) limits their usage for tasks that require numeric comprehension or strict mathematical reasoning. However, as we illustrate in this paper, building a general purpose language model that also happens to be proficient in mathematical reasoning is not as straight-forward as training it on a numeric dataset. In this work, we develop a novel framework that enables language models to be mathematically proficient while retaining their linguistic prowess. Specifically, we offer information-theoretic interventions to overcome the catastrophic forgetting of linguistic skills that occurs while injecting non-linguistic skills into language models.
translated by 谷歌翻译
We present the Habitat-Matterport 3D Semantics (HM3DSEM) dataset. HM3DSEM is the largest dataset of 3D real-world spaces with densely annotated semantics that is currently available to the academic community. It consists of 142,646 object instance annotations across 216 3D spaces and 3,100 rooms within those spaces. The scale, quality, and diversity of object annotations far exceed those of prior datasets. A key difference setting apart HM3DSEM from other datasets is the use of texture information to annotate pixel-accurate object boundaries. We demonstrate the effectiveness of HM3DSEM dataset for the Object Goal Navigation task using different methods. Policies trained using HM3DSEM perform outperform those trained on prior datasets. Introduction of HM3DSEM in the Habitat ObjectNav Challenge lead to an increase in participation from 400 submissions in 2021 to 1022 submissions in 2022.
translated by 谷歌翻译
本文与社交网络上的在线有针对性广告有关。我们解决的主要技术任务是估计用户对的激活概率,这可以量化一个用户对购买决策的影响的影响。这是一项具有挑战性的任务,因为一个营销事件通常涉及多种产品的多种营销活动/策略。在本文中,我们提出了我们认为是第一个基于张量的在线广告上的基于张量的上下文强盗框架。该拟议的框架旨在以多模式张量的形式适应任何数量的特征向量,从而使以统一的方式捕获与用户偏好,产品和广告系列策略可能存在的异质性。为了处理张量模式的相互依赖性,我们引入了具有平均场近似值的在线变分算法。我们从经验上确认,提出的Tensorucb算法在影响基准比基准的影响最大化任务方面取得了重大改进,这归因于其捕获用户产品异质性的能力。
translated by 谷歌翻译
在过去的几年中,从面部视频中对心脏脉搏的测量已成为对研究的有趣追求。这主要是由于以非侵入性方式获得个人心率的重要性越来越重要,这对于游戏和医疗行业的应用可能非常有用。在过去的几年中,研究的另一个工具领域是深度学习的出现,并使用深度神经网络来增强任务绩效。在这项工作中,我们建议使用有效的卷积网络来准确测量低分辨率面部视频的用户心率。此外,为了确保我们能够实时获得心律,我们通过修剪深度学习模型来压缩深度学习模型,从而减少其内存足迹。我们在MAHNOB数据集上基准了方法的性能,并在多种方法中比较了其性能。
translated by 谷歌翻译
面部视频中心率的估计在医疗和健身行业中有许多应用。此外,它在游戏领域也变得有用。已经提出了几种方法,可以从面部视频中无缝获得心率,但是这些方法在处理运动和照明工件方面存在问题。在这项工作中,我们使用用户的光谱反射率提出了一个可靠的人力资源估计框架,这使运动和照明干扰变得强大。我们采用基于学习的深度框架,例如更快的RCNNS来执行面部检测,而不是先前方法使用的中提琴琼斯算法。我们在Mahnob HCI数据集上评估了我们的方法,发现所提出的方法能够超越先前的方法。从面部视频中估计心率在医疗和健身行业中有许多应用。此外,它在游戏领域也变得有用。已经提出了几种方法,可以从面部视频中无缝获得心率,但是这些方法在处理运动和照明工件方面存在问题。在这项工作中,我们使用用户的光谱反射率提出了一个可靠的人力资源估计框架,这使运动和照明干扰变得强大。我们采用基于学习的深度框架,例如更快的RCNNS来执行面部检测,而不是先前方法使用的中提琴算法。我们在MAHNOB HCI数据集上评估了我们的方法,发现所提出的方法能够超过以前的方法。
translated by 谷歌翻译